Writing a function

The function template is a useful way to start writing a function:

my_fun <- function(arg1, arg2) {
# body
}

my_fun is the variable that you want to assign your function to, arg1 and arg2 are arguments to the function. The template has two arguments, but you can specify any number of arguments, each separated by a comma. You then replace # body with the R code that your function will execute, referring to the inputs by the argument names you specified.

# Define ratio() function
ratio <- function(x, y) {
  x / y
}

# Call ratio() with arguments 3 and 4
ratio(3, 4)
## [1] 0.75

Arguments

How did you call your function ratio() in the previous exercise? Do you remember the two ways to specify the arguments? (If you have forgotten it might be useful to review the video from Intermediate R).

You probably either did ratio(3, 4), which relies on matching by position, or ratio(x = 3, y = 4), which relies on matching by name.

For functions you and others use often, it’s okay to use positional matching for the first one or two arguments. These are usually the data to be computed on. Good examples are the x argument to the summary functions (mean(), sd(), etc.) and the x and y arguments to plotting functions.

However, beyond the first couple of arguments you should always use matching by name. It makes your code much easier for you and others to read. This is particularly important if the argument is optional, because it has a default. When overriding a default value, it’s good practice to use the name.

Notice that when you call a function, you should place a space around = in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.

# Rewrite the call to follow best practices
mean(c(1:9, NA), trim = 0.1, na.rm = TRUE)
## [1] 5

Subsetting lists

There are a few ways to subset a list. Throughout the course we’ll mostly use double bracket ([[]]) subsetting by index and by name.

That is, my_list[[1]] extracts the first element of the list my_list, and my_list[[“name”]] extracts the element in my_list that is called name. If the list is nested you can travel down the hierarchy by recursive subsetting. For example, mylist[[1]][[“name”]] is the element called name inside the first element of my_list.

A data frame is just a special kind of list, so you can use double bracket subsetting on data frames too. my_df[[1]] will extract the first column of a data frame and my_df[["name"]] will extract the column named name from the data frame.

I’ve set up a list called tricky_list in your workspace. Use the function typeof() combined with double bracket subsetting to answer the following questions.

rm(list = ls())

# Create a list
tricky_list = list(
  nums = c(-1.13171455, -0.90888158, -0.86307629, -0.48236520, -0.46887554,
         -0.67455525, 0.06206395, 0.16914475, 0.84829313, -0.14515730),
  y    = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE),
  x    = list("hello!", "hi!", "goodbye!", "bye!"),
  model = lm(mpg ~ wt, data = mtcars))

tricky_list
## $nums
##  [1] -1.13171455 -0.90888158 -0.86307629 -0.48236520 -0.46887554
##  [6] -0.67455525  0.06206395  0.16914475  0.84829313 -0.14515730
## 
## $y
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 
## $x
## $x[[1]]
## [1] "hello!"
## 
## $x[[2]]
## [1] "hi!"
## 
## $x[[3]]
## [1] "goodbye!"
## 
## $x[[4]]
## [1] "bye!"
## 
## 
## $model
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344
# 2nd element in tricky_list
typeof(tricky_list[[2]])
## [1] "logical"
# Element called x in tricky_list
typeof(tricky_list[["x"]])
## [1] "list"
# 2nd element inside the element called x in tricky_list
typeof(tricky_list[["x"]][[2]])
## [1] "character"

Exploring lists

Often you won’t know exactly what is inside a list. But, you’ll need to figure it out to get some useful piece of data. Extracting elements from the output of the names() and str() functions is a great way to explore the structure of a list.

Calling names() on a list will give you names at the top level of the list and str() will give you a full description of the entire list (which can sometimes be a little overwhelming).

tricky_list has a regression model stored in it. Let’s see if we can drill down and pull out the slope estimate corresponding to the wt variable.

# Guess where the regression model is stored
names(tricky_list)
## [1] "nums"  "y"     "x"     "model"
# Use names() and str() on the model element
names(tricky_list[["model"]])
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
str(tricky_list[["model"]])
## List of 12
##  $ coefficients : Named num [1:2] 37.29 -5.34
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
##  $ residuals    : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ effects      : Named num [1:32] -113.65 -29.116 -1.661 1.631 0.111 ...
##   ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:32] 23.3 21.9 24.9 20.1 18.9 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "wt"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.18 1.05
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 30
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = mpg ~ wt, data = mtcars)
##  $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
##   .. ..- attr(*, "variables")= language list(mpg, wt)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "mpg" "wt"
##   .. .. .. ..$ : chr "wt"
##   .. ..- attr(*, "term.labels")= chr "wt"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(mpg, wt)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
##  $ model        :'data.frame':   32 obs. of  2 variables:
##   ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
##   .. .. ..- attr(*, "variables")= language list(mpg, wt)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
##   .. .. .. .. ..$ : chr "wt"
##   .. .. ..- attr(*, "term.labels")= chr "wt"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(mpg, wt)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
##  - attr(*, "class")= chr "lm"
# Subset the coefficients element
tricky_list[["model"]][["coefficients"]]
## (Intercept)          wt 
##   37.285126   -5.344472
# Subset the wt element
tricky_list[["model"]][["coefficients"]][["wt"]]
## [1] -5.344472

Another way is to use the coef() function on the model element of tricky_list.

A safer way to create the sequence

Let’s take a look at the sequence component of our for loop:

i in 1:ncol(df)

Each time our for loop iterates, i takes the next value in 1:ncol(df). This is a pretty common model for a sequence: a sequence of consecutive integers designed to index over one dimension of our data.

What might surprise you is that this isn’t the best way to generate such a sequence, especially when you are using for loops inside your own functions. Let’s look at an example where df is an empty data frame:

df <- data.frame()
1:ncol(df)

for (i in 1:ncol(df)) {
  print(median(df[[i]]))
}

Our sequence is now the somewhat non-sensical: 1, 0. You might think you wouldn’t be silly enough to use a for loop with an empty data frame, but once you start writing your own functions, there’s no telling what the input will be.

A better method is to use the seq_along() function. This function generates a sequence along the index of the object passed to it, but handles the empty case much better.

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
) 

# Replace the 1:ncol(df) sequence
for (i in seq_along(df)) {
  print(median(df[[i]]))
}
## [1] -0.2593348
## [1] 0.2568551
## [1] -0.3548578
## [1] -0.5958872
# Create an empty data frame
empty_df <- data.frame()

# Repeat for loop to verify there is no error
for (i in seq_along(empty_df)) {
  print(median(empty_df[[i]]))
}

Keeping output

Our for loop does a good job displaying the column medians, but we might want to store these medians in a vector for future use.

Before you start the loop, you must always allocate sufficient space for the output, let’s say an object called output. This is very important for efficiency: if you grow the for loop at each iteration (e.g. using c()), your for loop will be very slow.

A general way of creating an empty vector of given length is the vector() function. It has two arguments: the type of the vector (“logical”, “integer”, “double”, “character”, etc.) and the length of the vector.

Then, at each iteration of the loop you must store the output in the corresponding entry of the output vector, i.e. assign the result to output[[i]]. (You might ask why we are using double brackets here when output is a vector. It’s primarily for generalizability: this subsetting will work whether output is a vector or a list.)

# Create new double vector: output
output <- vector("double", ncol(df))

# Alter the loop
for (i in seq_along(df)) {
  # Change code to store result in output
  output[[i]] = median(df[[i]])
}

# Print output
output
## [1] -0.2593348  0.2568551 -0.3548578 -0.5958872